Chapter 4 Exploratory Data Analysis

4.1 Start with dplyr counts and summaries in console

  • In his Tidy Tuesday live coding vidoes David Robinson usually starts exploring new data with dplyr::count() in the console. I reccomend this as the first step in your EDA too.

  • In the code below we don’t use the package name in the console (so breaking rule 1 I just told you). We don’t need the package name as we won’t save this code for others to read. This means we can type more quickly to explore the data with dplyr verbs faster.

4.2 Plot data points with geom_point()

  • After using dplyr::count(), dplyr::group_by() and dplyr::summarise(), try plotting the data points with ggplot2::geom_point(). It almost NEVER fails to show you what’s going on quickly. And it is unlikely to return a confusing ggplot error message.

  • ggplot2::geom_point() is the minimum and most reliable ggplot plot type (or geom) to start visualising data. Increasingly, I’m finding ggpplot2::geom_point() is often a good choice to include in the final pot too.

  • Let’s look at all the values of sales for each date.

## Warning: Removed 568 rows containing missing values (geom_point).

  • Now let’s look at the individual sales values for each city.
## Warning: Removed 568 rows containing missing values (geom_point).

  • With so many data points to plotthey overlap to create thick dark lines that might hide interesting variation. This is known as over plotting. Reduce over plotting by replacing ggplot2::geom_point() with ggplot2::geom_jitter(). It randomly jitters the location of the data points by a small amount so that fewer points overlap.

  • Sometimes there are so many data points that jittering does not reduce over plotting enough. You can also make the data points lighter using the alpha arugment of ggplot2::geom_jitter() as in the code below. The lower the value of alpha the fainter the points.

## Warning: Removed 568 rows containing missing values (geom_point).

  • Hadley Wickham has a few more tricks you can use in the free to access overplotting chapter of his ggplot2: Elegant Graphics for Data Analysis book.

  • We all know sales of most things vary by the time of year. So let’s now put date on the x axis, make city the colour, and because the data is over time we can join the data points by using ggplot2::geom_line().

  • We also use the reduced dataframe with fewer cities to create a plot that is less crowded.

## Warning: Removed 1 rows containing missing values (geom_path).

  • Beautiful. While sales have very different volumes between cities we can see they all follow a similar same seasonal pattern. To bring the patterns of sales closer to each other so that they are easier to compare we can transform the sales value by showing the log of sales. This is Hadley Wickham’s approach in ggplot2: Elegant Graphics for Data Analysis.

  • Wickham goes on to model the Texas housing sales data by fitting a linear model between the log of sales and the month, then plotting the residuals (i.e. the change in sales not explained by the month). This removes the strong seasonal effects. This is similar to the decomposition part of a classic time series analysis. Take a look at the recent fable forecasting package for another way to decompose a time series.

  • We will take a visual approach to reduce the seasonal effect in the Polish your final plot part of this chapter. The entire time series is simply plotted zoomed out with years clearly marked so that you can easily see the strong monthly sales pattern repeated each year.

## Warning: Removed 1 rows containing missing values (geom_path).

4.3 Facet by categories

  • So far we have shown the different sales patterns for each city by putting city into the colour arugment of ggplot. However, with lots of cities the plot gets too crowded. When you have many categories to compare like this then facets or “small multiples” are a good choice. This is a fancy way of saying draw a chart for each value in one or more columns then look at all the plots at once, usually in a grid.

  • An important setting for facets is to set the scales argument to "free" like this scales = "free". Each plot will then have its own scale set to the maximum of each city’s sales. This lets us more easily spot interesting differences or similarities in the patterns over time between each city.

## Warning: Removed 1 rows containing missing values (geom_path).

4.4 Facet interactively (trelliscopejs)

  • You can also facet and explore your data interactively with a GUI using trelliscopejs piped onto the end of your ggplot. Below we facet all the Texas cities in a trelliscopejs web page. Have a play with all the settings in the chart below to see what it does.

4.5 Loop to plot every category separately

  • To study each city as a full single chart on its own we can loop through the cities and plot them automatically with very little code.

  • To do this we “nest” a dataframe for each city into one dataframe. Then loop through each of the nested city dataframes creating a plot for each one.

  • Here we use dplyr::group_by() on city, then use tidyr::nest() to create the indivdiual nested dataframes.

  • If we pipe the nested dataframe into View() this shows us what a nested dataframe looks like.
  • We can also view one of the nested data frames using square brackets. Think of the numbers in the square brackets like the co-ordinates in Excel. The first number is the column position and the second number is the row position. In the example below we view the second column along of the nested dataframe [[2]] and the first row of that column [[1]].
  • Once you start to get comforable with how square brackets work it is a powerful way to navigate through nested R objects. The code above showed us the data in the first nested data frame of one city. The code below now drills down into the first column and first row of that single nested dataframe. This is just one cell so is the most granular drill down possible for the object df_red_nest.
  • We can now add a plot to each nested data frame for each city. We use purrr::map2(). This purr function is a compact way to loop through two arguments in another function. The function arguments of ggplot2::ggplot() being set by purrr::map2() are the .x and .y values. They are the nested dataframes in the data column, and the city names inside the city column of the nested dataframe df_red_nest respectively.
  • Take a look at the new nested data frame with a new column added containing a plot for each city.
  • Let’s also look at the information held for one of the plots, again using values in square brackets. The code below shows you that the plot is a series of nested lists that describe every element of the plot.
  • Finally, let’s print every plot quite simply with this code.
Show all the looped prints

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]
## Warning: Removed 1 rows containing missing values (geom_path).

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

## 
## [[10]]

## 
## [[11]]

## 
## [[12]]

## 
## [[13]]

## 
## [[14]]


4.6 Polish your final plot

  • We now have a bare minimum Exploratory Data Analysis toolkit. We explore the data with dplyr::count() from the console then visualise it with ggplot2::geom_point().

  • From exploring the data quickly we are soon ready to select a plot that tells an interesting story. But adding all the bells and whistles to make the final plot for a customer or publication can and does take a long time. So this polish shouldn’t be part of your exploratory data analysis.

  • Also, make sure the polishing is done with the clean Code style recommended earlier. You will find it quicker to comment out or tweak specific parts of your plot code until it looks just right. Clean code is faster to iterate.

  • The plot below isn’t perfect. There may be things you want to change depending both on what story you want to tell and your personal style.

  • How did I write this code? By Googling for what I wanted to do (e.g. “ggplot remove axis grid lines”), copying the code from a stackoverflow answer, then pasting the code into a clear structure as below.

  • Many of the tweaks or polish will be to ggplot2::theme() or ggplot2::scale() but are you really going to remember the ggplot code you need for every adjustment you want to make? I no longer worry about remembering how to do it and just focus on how I want it to look and get absorbed in the creation and satisfaction of the plot gradually improving.

  • After you have built a few of your own publication quality pots with clear code you will soon be using your own plot code as a store of examples to re-use meaning that you will Google less.

  • Be prepared for this code tweaking and plot polishing to take longer than you planned. Always.

## label_key: city
## Saving 7 x 5 in image
## Warning: package 'gdtools' was built under R version 3.6.1
## Warning: Removed 430 rows containing missing values (geom_path).